WB Project Data Preprocessing

Author

Luisa M. Mimmi

Published

September 13, 2024

Work in progress

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

library(rsample) # General Resampling Infrastructure

—————————————————————————-

Data sources

WB Projects & Operations

World Bank Projects & Operations can be explored at:

  1. Data Catalog. From which
  1. Advanced Search

Raw data

—————————————————————————

Attempt # 1 {Ingest Projects data (via API)}

DOESN’T WORK!

Attempt # 2.a {Ingest Projects data (manually split)}

I retrieved manually ALL WB projects approved between FY 1973 and 2023 (last FY incomplete) on 09/22/2022 (WDRs go from 1978-2022) using this example url and saved individual .xlsx files in data/raw_data/project

  • note the manual download is limited to # = 500

— Load all .xlsx files separately

— Save objs in folder as .Rds files separately

Attempt # 2.b Ingest Projects data (manually all together)!

  • I retrieved manually ALL WB projects approved between FY 1947 and 2026 as of 31/08/2024 using simply the Excel button on this page WBG Projects

  • then saved HUUUGE .xls files in data/raw_data/project2/all_projects_as_of29ago2024.xls

    • (plus a Rdata copy of the original file )
all_projects_as_of29ago2024   <- read_excel(here::here ("data", "raw_data", "project2","all_projects_as_of29ago2024.xls"), 
                                            col_names = FALSE,
                                            skip = 1) 
# Nomi delle colonne
cnames <- read_excel(here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.xls"), 
                         col_names = FALSE,
                         skip = 1,
                     n_max = 2) 
# file completo
all_proj <- read_excel(here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.xls"), 
                         col_names = TRUE,
                         skip = 2) 

save(all_proj, file = here::here("data", "raw_data", "project2", "all_projects_as_of29ago2024.Rdata") ) 
rm(all_projects_as_of29ago2024)

Explore Project mega file

paint::paint(cnames)
paint::paint(all_proj)
 
skimr::skim(all_proj$pdo)  # complete_rate 0.503
skimr::skim(all_proj$boardapprovaldate) # complete_rate 0.779
skimr::skim(all_proj$closingdate) # complete_rate 0.707 

_______

CLEANING FILE PDOs

_______

Clean all_proj

This data set has a lot of blank values, probably also bc some information was not collected way back in 1947… (e.g. PDO)

# Mess of data format weird in different ways in 2 cols: 
    # 1947-12-31 12:00:00 # closingdate
    # 8/3/1948 12:00:00 AM # closingdate
    # 
    # 1955-03-15T00:00:00Z # boardapprovaldate
 
# Mutate the date columns to parse the dates, handling different formats and blanks
all_proj_t <- all_proj %>%
   # 1) Parsed using parse_date_time() with mdy HMS and mdy HMSp to handle "MM/DD/YYYY HH:MM AM/PM" formats.
  mutate(across("closingdate", ~ if_else(
    . == "", 
    NA_POSIXct_,  # Return NA for blank entries
    parse_date_time(., orders = c("mdy HMS", "mdy HMSp"))
  )),
    # 2) Parsed using ymd_hms() because it follows the ISO 8601 format (e.g., "1952-04-29T00:00:00Z").
  across("boardapprovaldate", ~ if_else(
    . == "", 
    NA_POSIXct_,  # Return NA for blank entries
    ymd_hms(., tz = "UTC")  # Handle ISO 8601 format (e.g., "1952-04-29T00:00:00Z")
  ))) %>% 
   mutate(boardapproval_year = year(boardapprovaldate),
          boardapproval_month = month(boardapprovaldate)) %>% 
   mutate(boardapprovalFY = case_when( 
             boardapproval_month >= 1 & boardapproval_month < 7 ~ boardapproval_year,
             boardapproval_month >= 7 & boardapproval_month <= 12 ~ boardapproval_year +1)) %>% 
   relocate(boardapprovalFY, .after = boardapprovaldate ) %>% 
   mutate(closingdate_year = year(closingdate),
          closingdate_month = month(closingdate)) %>% 
   mutate(closingdateFY = case_when( 
             closingdate_month >= 1 & closingdate_month < 7 ~ closingdate_year,
             closingdate_month >= 7 & closingdate_month <= 12 ~ closingdate_year +1)) %>% 
   relocate(closingdateFY, .after = closingdate ) 

tabyl(all_proj$closingdate)   
tabyl(all_proj_t$closingdateFY)

tabyl(all_proj$boardapprovaldate)   
tabyl(all_proj_t$boardapprovalFY)

Explore who are the ones with no PDO

# Function to count missing values in a subset of columns
count_missing_values <- function(data, columns) {
  # Select the subset of columns
  df_subset <- data %>% select(all_of(columns))
  
  # Use skimr to skim the data
  skimmed <- skim(df_subset)
  
  # Extract the relevant columns for column names and missing values
  missing_table <- skimmed %>%
    select(skim_variable, n_missing)
  
  # Return the table
  return(missing_table)
}

# Use the function on a subset of columns
count_missing_values(all_proj_t, c("pdo", "projectstatusdisplay", "boardapprovalFY", "sector1", "theme1"))

missing_pdo <- all_proj_t %>% 
   #select(id, pdo, countryname, projectstatusdisplay, lendinginstr, boardapprovalFY, projectfinancialtype) %>% 
   filter(is.na(pdo))

# Now I compare to get a sense of distribution in all_proj_t v. missing_pdo... 
tabyl(all_proj_t$projectstatusdisplay) %>%  adorn_pct_formatting()
tabyl(missing_pdo$projectstatusdisplay) %>%  adorn_pct_formatting()

tabyl(all_proj_t$regionname)  %>% adorn_pct_formatting() 
tabyl(missing_pdo$regionname)  %>% adorn_pct_formatting() 

tabyl(all_proj_t$boardapprovalFY) %>%  adorn_pct_formatting()
tabyl(missing_pdo$boardapprovalFY) %>%  adorn_pct_formatting()

tabyl(all_proj_t$projectfinancialtype)  %>% adorn_pct_formatting() 
tabyl(missing_pdo$projectfinancialtype)  %>% adorn_pct_formatting() 

tabyl(all_proj_t$sector1)  %>% adorn_pct_formatting() 
tabyl(missing_pdo$sector1)  %>% adorn_pct_formatting() 

tabyl(all_proj_t$theme1)  %>% adorn_pct_formatting() 
tabyl(missing_pdo$theme1)  %>% adorn_pct_formatting() # most NA

#Environmental Assessment Category
tabyl(all_proj_t$envassesmentcategorycode)  %>% adorn_pct_formatting() # most NA
tabyl(missing_pdo$envassesmentcategorycode)  %>% adorn_pct_formatting() 
# Environmental and Social Risk
tabyl(all_proj_t$esrc_ovrl_risk_rate)  %>% adorn_pct_formatting() # most NA
tabyl(missing_pdo$esrc_ovrl_risk_rate)  %>% adorn_pct_formatting() 

tabyl(all_proj_t$lendinginstr)  %>% adorn_pct_formatting()  # Specific Investment Loan 4928   43.9%
tabyl(missing_pdo$lendinginstr)  %>% adorn_pct_formatting()  # Specific Investment Loan 4928   43.9%

Based on some “critical” category, I would say that even if many projects are missing PDO the incidence seems to happen at random, except maybe for lendinginstr specific Investment Loan are missing PDO in 4928 pr (43.9%). Why?

https://stackoverflow.com/questions/71608612/how-to-add-keybindings-in-visual-studio-code-for-the-r-terminal

_______

PREPROCESSING

_______

Obtain Reduced df projs

For my purposes it is safe to drop all the projects with missing PDO !

  • it turns out there are no Development objectives spelled out until FY2001
projs <- all_proj_t %>% 
   filter(!is.na(pdo)) %>% 
   filter(!is.na(projectstatusdisplay)) %>% 
   filter(boardapprovalFY < 2024 & boardapprovalFY >1972)  %>% 
   select(id, pr_name = project_name, pdo, boardapprovalFY, closingdateFY,status = projectstatusdisplay, regionname, countryname, sector1, theme1 ,
   lendinginstr,env_cat = envassesmentcategorycode, ESrisk = esrc_ovrl_risk_rate ,curr_total_commitment  )   


tabyl(projs$boardapprovalFY)  %>% adorn_pct_formatting() # most NA

nrow(projs) # 8836 
paint(cnames) 
paint(projs) 

rm(  "all_proj", "all_proj_t" , "cnames" , "count_missing_values", "missing_pdo"  )

Manual correction text [projs]

projs$pdo[projs$id == "P164414"] <- "The Multisector Development Policy Financing (DPF) intends to support Ukraine's highest priority reforms to move from economic stabilization to stronger and sustained economic growth by addressing deeper structural bottlenecks and governance challenges in key areas. Possible policy areas include : (i) strengthening private sector competitiveness, including reforming land markets and the financial sector; (ii) promoting sustainable and effective public services, including reforming pensions, social assistance, and health; and (iii ) improving governance, including reforming anticorruption institutions and tax administration. The financing  DPL or Policy Based Guarantee (PBG)."

projs$pdo[projs$id == "P111432"] <- "Project development objectives for RCIP 3 include the following: Malawi: Support the Recipient's efforts to improve the quality, availability and affordability of broadband within its territory for both public and private users. Mozambique: Support the Recipient's efforts to contribute to lower prices for international capacity and extend the geographic reach of broadband networks and to contribute to improved efficiency and transparency through eGovernment applications. Tanzania: Support the Recipient's efforts to: (i) lower prices for international capacity and extend the geographic reach of broadband networks; and (ii) improve the Government's efficiency and transparency through eGovernment applications."


projs$pdo[projs$id == "P252350"] <- "The Program Development Objective is to expand opportunities for the acquisition of quality, market-relevant skills in selected economic sectors. The selected economic sectors include Energy, Transport and Logistics, and Manufacturing (with a focus on ‘Made-Rwanda’ products such as construction materials, light manufacturing and agro-processing). Building skills to advance the country’s economic agenda is a key priority of the GoR’s ongoing Economic Development and Poverty Reduction Strategy-2 (EDPRS2) launched in 2013. EDPRS2 builds on the country’s Vision 2020 which seeks to transform the country by raising its per capita GDP to middle-incomelevel by 2020. The Program is grounded in the Government of Rwanda’s (GoR) National Employment Programs (NEP) approved by Cabinet in 2014. NEP was designed to address the employment challenges in Rwanda and equip its population with the skills required to supporteconomic development. The main results areas of the operation are: (i) reinforcing governance of the skills development system; (ii) ensuring provision of quality training programs with market relevance; (iii) expanding opportunities for continuous upgrading of job-relevant skills for sustained employability; and (iv) capacity building for implementation. The Program will disburse against achievement of specific Disbursement Linked Results (DLRs) in these results areas"

_______

SPLITTING SAMPLE

_______

(also to work on something smaller)

# ensure we always get the same result when sampling (for convenience )
set.seed(12345)

# use `regionname` as strata 
tabyl(projs$regionname)

projs_split <- projs %>%
  # define the training proportion as 75%
   rsample::initial_validation_split(prop = c(0.50, 0.25),
                                     # ensuring both sets are balanced in gender
                                     strata = regionname)

# resulting 3 datasets
projs_train <- rsample::training(projs_split)
projs_val <- rsample::validation(projs_split)
projs_test <- rsample::testing(projs_split)

tabyl(projs_train$regionname)
tabyl(projs_val$regionname)
tabyl(projs_test$regionname)

_______

TEXT ANALYSIS

_______

i) Tokenization

Where a word is more pdotract, a “type” is a concrete term used in actual language, and a “token” is the particular instance we’re interested in (e.g. pdotract things (‘wizards’) and individual instances of the thing (‘Harry Potter.’). Breaking a piece of text into words is thus called “tokenization”, and it can be done in many ways.

— The choices of tokenization

  1. Should words be lowercased? x
  2. Should punctuation be removed? x
  3. Should numbers be replaced by some placeholder?
  4. Should words be stemmed (also called lemmatization). x
  5. Should bigrams/multi-word phrase be used instead of single word phrases?
  6. Should stopwords (the most common words) be removed? x
  7. Should rare words be removed?

— Tokenization 1 PDO | regular expression

The R function strsplit lets us do just this: split a string into pieces. *Note, for example, that this makes the word “Don’t” into two words.

tok_simple <- as_tibble(projs_train$pdo[1] ) %>%
  str_split("[^A-Za-z]") # “split on anything that isn’t a letter between A and Z.”

str(tok_simple) # list of characters 
tok_simple[[1]]

— Tokenization 1 PDO | tidytext (ILLUSTRATION)

The simplest way is to remove anything that isn’t a letter. The workhorse function in tidytext is unnest_tokens. It creates a new columns (here called ‘words’) from each of the individual ones in text.

pdo_1 <- as_tibble(projs_train$pdo[1] )

# LIST OF features I can add to `unnest_tokens`
tok_feat_l <- list(
   # 1) all 2 lowercase 
   pdo_1 %>% unnest_tokens(word, value) %>% 
      select(lowercase = word),
   # 4) `SnowballC::wordStem` extracts stems of each given words in the vector.
   pdo_1 %>% unnest_tokens(word, value) %>% rowwise() %>% mutate(word = SnowballC::wordStem(word)) %>% 
      select(stemmed = word),
   # 1.b) keep uppercase if there are 
   pdo_1 %>% unnest_tokens(word, value, to_lower = F) %>% 
      select(uppercase = word),
   # 2) keep punctuation {default is rid} 
   pdo_1 %>% unnest_tokens(word, value, to_lower = F, strip_punc = FALSE) %>% 
      select(punctuations = word),
   # 5) bigram
   pdo_1 %>% unnest_tokens(word, value, token = "ngrams", n = 2, to_lower = F) %>%
      select(bigrams = word)
)

# Return a data frame created by column-binding.
tok_feat_df <- map_dfc(tok_feat_l  , ~ .x %>%  head(50))
tok_feat_df

# # my choice 
# pdo_1_t_mod <- pdo_1 %>% 
#   # no punctuation, yes capitalized
#   unnest_tokens(word, value, to_lower = F, strip_punc = TRUE) %>% # 249 obs
#   # exclude stopwords 
#   anti_join(stop_words) # 109 obs
# 
# head(pdo_1_t_mod, 15)

— Tokenize train PDOs | tidytext

# pdo_train_token <- projs_train %>%  # 4416 
#    ungroup() %>%
#    # Drop some useless columns PDOs 
#    dplyr::select(id, boardapprovalFY, pr_name, regionname, pdo) %>% 
#    tidytext::unnest_tokens(output =  word,
#                            token = "words",
#                            input = pdo ,
#                            to_lower = T, # otherwise cannot match the stop_words  
#                            strip_punc = TRUE,
#                            drop = F # keep original text col (input)
#    ) %>% # 221,456
#    relocate (pdo, .before = "word") # 220,777

But I want to preserve the Hyphenated words so:

community-based ecc remain

pdo_train_token <- projs_train %>%  # 4416 
   ungroup() %>%
   # Drop some useless columns PDOs 
   dplyr::select(id, boardapprovalFY, pr_name, regionname, pdo) %>% 
   # Step 1: Replace hyphens with a placeholder (e.g., "HYPHEN")
  mutate(pdo_modified = str_replace_all(pdo, "-", "HYPHEN")) %>% 
# Step 2: Unnest tokens, with punctuation stripping (but hyphens are now preserved)
   tidytext::unnest_tokens(output =  word,
                           token = "words",
                           input = pdo_modified ,
                           to_lower = T, # otherwise cannot match the stop_words  
                           strip_punc = TRUE,
                           drop = F # keep original text col (input)
   ) %>% # 218,193
   select(-pdo_modified) %>%
   relocate (pdo, .before = "word")  %>% 
# Step 3: Replace the placeholder back with a hyphen
   mutate(word = str_replace_all(word, "hyphen", "-")) %>% 
   mutate(word = str_replace_all(word, "covid-19", "covid19"))  # 218,193


# Now `df_tokens` will have hyphenated words preserved

— Default tidytext packg stop_words

# default stopwords that come with the tidytext package t
sw <- tidytext::stop_words 
paint(stop_words)

— My own custom_stop_words |

Remove stop words, which are the most common words in a language.

# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
   # Articles
   "the", "a", "an",   
   "and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",  
   # Prepositions
   "of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through", 
   "during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
   "over", "again", "further", "then", "once",  
   # Pronouns
   "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
   "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", 
   "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
   "this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
   "when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
   # "some", "such", "no",  "not", 
   # "too", "very",   
      # verbs
   "is", "are", "would", "could", "will", "be"
)

# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)

— Remove stop_words from train PDOs

pdo_train_token <- pdo_train_token %>%  # 218,193
   # get rid of stop words (from defualt list)   
   anti_join(custom_stop_words_df, by = join_by(word)) # 140,741

# Count words
count_train <- pdo_train_token %>%
  count(word, sort = TRUE) # 11,200

— Other unwanted tokens

You may want to use your own curated list

  • no numbers (not needed in this context)
  • no text's (TURNED TO text )
  • no units of measurement

First, I check which they are…

# further restriction on the words
pdo_train_tok <- pdo_train_token %>%  # 142,195
   mutate(word_original = word) %>% 
   relocate(word_original, .after = pdo) %>% 
#### ------  NO `numbers` 
   # The regex "\\d" detects any digit (0-9) 
   # The regex "^\\d*\\.?\\d+$" any string that consists only of digits (with an optional decimal point)
   mutate (word_num = str_detect(word_original, "^\\d*\\.?\\d+$")) %>% 
   # The regex "^\\d*\\.?\\d+$" match numbers with no letters in the cell, allowing for both decimal points and thousands separators
   mutate (word_num2 = str_detect(word_original, "^\\d{1,3}(,\\d{3})*(\\.\\d+)?$"))  %>% 
#### ------  NO punctuation signs (except for hyphens)       
   # The regex ""[[:punct:]]"" match numbers that any string that contains at least one punctuation symbol or sign.
   mutate (word_punct2 = str_detect(word_original, "[[:punct:]]") & !str_detect(word_original, "^[[:alpha:]]+-[[:alpha:]]+$")) %>% 
#### ------  NO hyphen with nothing else (redundant for above )
   mutate (word_hyp = str_detect(word_original, "^-$")) %>%
#### ------  NO units      
   mutate(word_units = str_detect(word_original, "\\b(usd|mw|gw|kwh|1,2,3)\\b")) %>% 
#### ------  `text's` (TURNED TO `text` )
   #### 1/2 .......  contains `'s` 
   mutate(word_s = str_detect(word_original, "\\b\\w+'s\\b")) %>%  
   #### 2/2 .......  looks for any word ending with 's and REPLACES it with JUST the word before the apostrophe
   mutate(word = str_replace_all(word_original, "\\b(\\w+)'s\\b", "\\1"))  

… then I get rid of the unwanted tokens

pdo_train_t <- pdo_train_tok %>% # 138,210
# get rid of numbers and other non meaningful words.... 
   filter (word_num == FALSE)  %>%    # ... >  139,406
   filter (word_num2 == FALSE)  %>%   # ... >  139,300
   filter (word_punct2 == FALSE)  %>% # ... >  137,693
   # filter (word_hyp == FALSE)  %>%    # ... >  139,144 (redudant with above)
   filter (word_units == FALSE)  %>%  # ... >  135,129
# DROP temporary cols
   select (-word_num, -word_num2, -word_punct2, -word_hyp, -word_units, -word_s)

# Count words
count_train <- pdo_train_t  %>%
  count(word, sort = TRUE) # 11,201--> 10,345

ii) Word stemming

to reduce them to their word stem or root form

  • this sucks!
pdo_train_t <- pdo_train_t %>% 
   mutate(stem = wordStem(word))  # 135,129

_______

TEXT ANALYSIS/SUMMARY

_______

_______

>>>>>> QUI <<<<<<<<<<<<<<<<<<

rivedere cos’avevo fatto x pulire in analysis//03_WDR_pdotracs_explor.qmd https://cengel.github.io/R-text-analysis/textprep.html#detecting-patterns https://guides.library.upenn.edu/penntdm/r https://smltar.com/stemming#how-to-stem-text-in-r BOOK STEMMING

START FROM ## III.i) Tokenization

_______

see https://cengel.github.io/R-text-analysis/textanalysis.html

Frequencies of documents/words/stems

# Count words
counts_pdo <- pdo_train_t %>%
     count(pdo, sort = TRUE)  # 4,069

counts_words <- pdo_train_t %>%
     count(word, sort = TRUE)  # 10,366

counts_stems <- pdo_train_t %>%
  count(stem, sort = TRUE)   # 7,846

Word freq ggplot

pdo_train_t %>%
   filter (!(word %in% c("pdo","project", "development", "objective", "i","ii", "iii"))) %>%
   count(word) %>% 
   filter(n > 500) %>% 
   mutate(word = reorder(word, n)) %>%  # reorder values by frequency
   ggplot(aes(word, n)) +
   geom_col(fill = "gray") +
   coord_flip()  # flip x and y coordinates so we can read the words better

Stem freq ggplot

pdo_train_t %>%
   filter (!(stem %in% c("pdo","project", "development", "objective", "i","ii", "iii"))) %>%
   count(stem) %>% 
   filter(n > 500) %>% 
   mutate(stem = reorder(stem, n)) %>%  # reorder values by frequency
   ggplot(aes(stem, n)) +
   geom_col(fill = "gray") +
   coord_flip()  # flip x and y coordinates so we can read the words better

We can pipe this into ggplot to make a graph of the words that occur more that 2000 times. We count the words and use geom_col to represent the n values.

Isolate sector words and see frequency over years

df <- pdo_train_t %>%
   filter (stem %in% c("water", "transport", "urban", "energi", "health")) %>%
   mutate (FY = boardapprovalFY) %>%
   # group_by(FY) %>% 
   #summarize (n_rep = length(stem)) %>%
   count(FY,  stem) 

#df$FY

ggplot(data = df, aes(x = FY, y = n, group = stem, color = stem)) +
   geom_line() +
   geom_point() +
   scale_x_continuous(breaks =  seq(2001, 2023, by=  2)) +
   scale_color_viridis_d(option = "magma", end = 0.9) + 
   facet_wrap(~stem, ncol = 2, scales = "free")+   guides(color = FALSE) +
   theme_bw()+
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1)) + 
   labs(title = "Sector words frequency in PDO over Fiscal Years",x =   "Board approval FY", y = "Counts of 'sector' word (stem)") + 
   geom_vline(data = subset(df, stem == "health"), aes(xintercept = 2020), 
              linetype = "dashed", color = "#9b6723") +
   geom_text(data = subset(df, stem == "health"), aes(x = 2020, y = max(df$n)*0.85, label = "Covid"), 
             angle = 90, vjust = -0.5, color = "#9b6723")

Term frequency

Word and document frequency: Tf-idf

The goal is to quantify what a document is about. What is the document about?

  • term frequency (tf) = how frequently a word occurs in a document… but there are words that occur many time and are not important
  • term’s inverse document frequency (idf) = decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
  • statistic tf-idf (= tf-idf) = an alternative to using stopwords is the frequency of a term adjusted for how rarely it is used. [It measures how important a word is to a document in a collection (or corpus) of documents, but it is still a rule-of-thumb or heuristic quantity]

The tf-idf is the product of the term frequency and the inverse document frequency::

N-Grams

Co-occurrence

_______

TOPIC MODELING

_______

Topic modeling is an unsupervised machine learning technique used to hat exploratively identifies latent topics based on frequently co-occurring words.

It can identify topics or themes that occur in a collection of documents, allowing hidden patterns and relationships within text data to be discovered. It is widely applied in fields such as social sciences and humanities.

https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html

https://m-clark.github.io/text-analysis-with-R/topic-modeling.html

https://sicss.io/2020/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html

Document-Term Matrix

Latent Dirichlet Allocation (LDA)

… SPIEGA https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

/include independent variables in my topic model?

https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html#how-do-i-include-independent-variables-in-my-topic-model

_______

STRUCTURAL TOPIC MODELING (STM)

_______

The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both.

MAIN REFERENCE stm R package http://www.structuraltopicmodel.com/ EXAMPLE UN corpus https://content-analysis-with-r.com/6-topic_models.html STM 1/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-i-2da2b353d362 STM 2/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-ii-462e6e07328

BERTopic

Developed by Maarten Grootendorst, BERTopic enhances the process of discovering topics by using document embeddings and a class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF).

https://medium.com/(supunicgn/a-beginners-guide-to-bertopic-5c8d3af281e8?)

_______

(dYnamic) TOPIC MODELING OVER TIME

_______

Example: An analysis of Peter Pan using the R package koRpus https://ladal.edu.au/topicmodels.html#Topic_proportions_over_time